DSA Homework 5

Roger Jang


Due date: 20160522 23:59:59

English spelling checker

Problem definition

In this homework, you need to implement an English spelling checker (and corrector). In general, there are two types of spelling errors:

For simplicity, you only need to deal with non-word errors without context information in this homework. More specifically, given an English word (which might be mis-spelled), you need to check if it is listed in a dictionary. If no, you need to give suggestions as what are the most likely words that can be used to replace the original mis-spelled word.

For this homework, use the CMU pronuncing dictionary. Please read its introduction in the previous link, and download the lastest version of CMUDict-0.7b to check out its format. (Remember that we are only using the dictionary to check the spelling of a word. The pronunciation part is not used at all.)

For the suggestions to a mis-spelled word, you can first create a confusable set and then delete those not in the dictionary. For a given mis-spelled word of length $n$, the confusable set can be generated by the following operations (see Peter Norvig's spelling corrector):

  1. Insert: $26(n+1)$
    For instance: face --> {Xface|X=a~z} $\cup$ {fXace|X=a~z} $\cup$ {faXce|X=a~z} $\cup$ {facXe|X=a~z} $\cup$ {faceX|X=a~z}
  2. Delete: $n$
    For instance: face --> {ace, fce, fae, fac}
  3. Substitute: $26n$
    For instance: face --> {Xace|X=a~z} $\cup$ {fXce|X=a~z} $\cup$ {faXe|X=a~z} $\cup$ {facX|X=a~z}
  4. Transpose: $n-1$
    For instance: face --> {afce, fcae, faec}
This leads to a total of $54n+25$, with some duplicates. We can denote the above confusable set as $ED_1(string)$, which is set of words that have edit distance of 1 to the given $string$. For instance, the number of elements $ED_1('something')$ is 494 (with duplicate removed). Similarly, we can have $ED_2(string) = ED_1(ED_1(string))$, which is a set of words that have edit distance of 2 to the given $string$. This set is easy to write, but takes much longer to compute. For instance, the size of $ED_2('something')$ is 114,324 (with duplicates removed). Then we need to remove words not in the dictionary. If this step is denoted as $clean$, then our approach is like this:
  1. $x_1=ED_1('something')$ ==> $size(x_1)$=494
  2. $x_2=ED_1(x_1)$ ==> $size(x_2)$=114,324
  3. $out=clean(x_1 \cup x_2)$ ==> $size(out)$=5, that is, $out={'seething', 'smoothing', 'something', 'somethings', 'soothing'}$.
(Note that the edit distance used here has unconventional operators such as transposition. Other operators can be defined if necessary.)

The dictionary file is on linux1~linux13: /tmp2/dsa2016_hw5/cmudict-0.7b

An typical example of input file is as follows:

severly	xxx...
symetrical	xxx...
affets	xxx...
pinoneered	xxx...
xxyyzz	xxx...
happy	xxx...
$severely	xxx...
ever-day	xxx...
ma'amam
.
.
.
That is, each row starts with a correct or mis-spelled English word, possibly with some other irrelevant information (as denoted by "xxx..." in the previous example) separated by a tab or a space. In other words, you should read the word at the beginning of each line till the first space or tab, or till the end of line, whichever comes first. Please check out the input files of the following test cases to have a better understanding of the format.

The output corresponding to the previous example input is shown next:

severly ==> beverley beverly caverly cheverly cleverly ...
symetrical ==> asymmetrical metrical symmetrical
affets ==> affect affects assets buffets effects ...
pinoneered ==> pioneered
xxyyzz ==> NONE
happy ==> OK
$severely ==> severely
ever-day ==> eveready everyday
ma'amam ==> ma'am macadam manama
.
.
.
More about the format:
  1. The first word is the original one (either correct or mis-spelled).
  2. Everything after "==>" is the list of suggested words in the dictionary, in ascending alphabetic order, separated by a space.
  3. If the given word is correct (already in the dictionary), then the suggested word is "OK" alone, such as "happy ==> OK".
  4. If the given word is mis-spelled and we cannot find any suggested words, then the suggested word is "NONE" alone, such as "xxyyzz ==> NONE".
  5. If the given word has uppercase letters or non-alphabets (such as digits or dashes), just take it as it is to perform the operations.

Requirements & suggestions

Test cases

  1. Input file aspell-.31-normal.res.txt, output file aspell-.31-normal.res_output.txt
  2. Input file aspell-.33-normal.res.txt, output file aspell-.33-normal.res_output.txt
(The input files are obtained from the test data page of the aspell project.)

Additional comments after meeting with students at the basement on 2016/05/18

Submission guidelines

Please submit your code with GitHub as directed in the GitHub submission guide. Your directory structure (under GitHub repository) should be:
  1. hw5/*, your source code.
  2. hw5/Makefile, where the TAs can use the command ¡¨make¡¨ on CSIE R217 linux machines to compile your code, and ¡¨make run¡¨ to run your program.
You have to make sure your code can be compiled on CSIE R217 linux machines with your Makefile and run normally since we will test your program on CSIE R217 linux machines.